In this notebook we are planning to start with an object that has
already been integrated. This integrated object has a column that
indicates the original batch (currently labeled batch but
will be renamed to “sample”) and a column that indicates cell types
(celltype) in the colData.
We can also grab any other additional sample-level metadata from the metadata file for this dataset. We can use that to help identify groups of interest for differential expression.
In this first section, we explore the integrated object and identify libraries that would be good to use in setting up differential expression.
set.seed(2022)
library(magrittr)
library(SingleCellExperiment)
library(ggplot2)
The merged and integrated datasets are available on S3 right now
(s3://sc-data-integration/scpca), but after deciding what
exactly we will use as input here, we will create the proper setup
scripts and move the data to the server starting with the individual SCE
objects used to create the integrated dataset.
# set up file paths and read in combined SCE and integrated SCE
data_dir <- file.path("..", "setup", "rms")
combined_sce_file <- file.path(data_dir, "dge", "SCPCP000005_merged_sce.rds")
combined_sce <- readr::read_rds(combined_sce_file)
integrated_sce_file <- file.path(data_dir, "dge", "SCPCP000005_integrated_fastmnn_sce.rds")
integrated_sce <- readr::read_rds(integrated_sce_file)
# put the integrated PCA and UMAP into the combined object so all the data is together
reducedDim(combined_sce, "fastMNN_PCA") <- reducedDim(integrated_sce, "fastmnn_PCA")
reducedDim(combined_sce, "fastMNN_UMAP") <- reducedDim(integrated_sce, "fastmnn_UMAP")
reducedDimNames(combined_sce)
[1] "PCA" "UMAP" "fastMNN_PCA" "fastMNN_UMAP"
# file path to sample level metadata
sample_metadata_file <- file.path(data_dir, "rms_sample_metadata.tsv")
# read in sample level metadata to use for DE
sample_metadata <- readr::read_tsv(sample_metadata_file)
Rows: 10 Columns: 10
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (10): project_id, submitter, library_id, sample_id, diagnosis, technology, seq_unit...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s take the sample metadata file and add in the metadata to the
colData of the combined SCE object. While doing this we can
also do some modifying of the columns in the colData, such
as creating a celltype_broad and celltype_fine
column to use for plotting.
# collapse cell types
coldata_df <- colData(combined_sce) %>%
as.data.frame() %>%
tibble::rownames_to_column("cell_id") %>%
# remove any of the subtypes to create new cell type column
dplyr::mutate(celltype_broad = stringr::str_remove(celltype, "-[ABCD]$"),
celltype_fine = celltype) %>%
# merge with sample metadata
dplyr::left_join(sample_metadata, by = c("batch" = "library_id")) %>%
# simplify subdiagnosis for later
dplyr::mutate(diagnosis_groups = factor(
dplyr::case_when(
subdiagnosis == "Alveolar rhabdomyosarcoma" ~ "ARMS",
subdiagnosis == "Embryonal rhabdomyosarcoma" ~ "ERMS"
)))
Error: Column name `cell_id` must not be duplicated.
Run `rlang::last_error()` to see where the error occurred.
Let’s just do some initial exploration of the integrated dataset that we are working with.
# UMAP to show cell type and integration
scater::plotReducedDim(combined_sce,
dimred = "fastMNN_UMAP",
colour_by = "batch",
point_size= 0.5,
point_alpha = 0.2)
# show cell types in the integrated dataset
scater::plotReducedDim(combined_sce,
dimred = "fastMNN_UMAP",
colour_by = "celltype_broad",
point_size= 0.5,
point_alpha = 0.4)
# facet the UMAP by sample to show the distribution of cell types across each sample
scater::plotReducedDim(combined_sce,
dimred = "fastMNN_UMAP",
colour_by = "celltype_broad",
point_size= 0.5,
point_alpha = 0.4,
other_fields = "batch") +
facet_wrap(~ batch, nrow = 3)
Looking at this plot lets us know which cell types belong to which
sample and can help us pick which libraries we might want to consider
for differential expression. From this we see that 6 of the libraries
show a split between Tumor_Mesoderm,
Tumor_Myoblast and Tumor_Myocyte. 3 libraries
are mostly Tumor and they probably weren’t able to be
classified into sub types. Then 1 library has what appears to be all
NA. Let’s just look at the libraries that have myoblast,
mesoderm, and myocytes and see if we can do some differential expression
between those libraries.
# define libraries to keep
scpca_library_ids <- c(
"SCPCL000479",
"SCPCL000480",
"SCPCL000481",
"SCPCL000484",
"SCPCL000488",
"SCPCL000491"
)
# subset sce to only contain batches with library IDs of interest
batches_to_keep <- combined_sce$batch %in% scpca_library_ids
rms_sce <- combined_sce[,batches_to_keep]
Let’s make some of the same plots above and show the distribution of cell types.
# UMAP to show cell type and integration
scater::plotReducedDim(rms_sce,
dimred = "fastMNN_UMAP",
colour_by = "batch",
point_size= 0.5,
point_alpha = 0.2)
scater::plotReducedDim(rms_sce,
dimred = "fastMNN_UMAP",
colour_by = "celltype_broad",
point_size= 0.5,
point_alpha = 0.4,
other_fields = "batch") +
facet_wrap(~ batch)
It looks like the pink cells, or the Tumor_Myoblast are
the most dominant cell type across the libraries. This might be an
interesting group to start with for differential expression. We can
confirm this by looking at the total numbers of cells in each cell
type.
as.data.frame(colData(rms_sce)) %>%
dplyr::count(celltype_broad)
# can also print out a summary of the cell types across each library
colData(rms_sce) %>%
as.data.frame() %>%
dplyr::count(batch, celltype_broad) %>%
dplyr::group_split(batch)
<list_of<
tbl_df<
batch : character
celltype_broad: character
n : integer
>
>[6]>
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
NA
Tumor_Myoblast is definitely the top represented cell
type across the libraries. So we want to look at myoblasts across our
libraries and see if there are any changes in the myoblast population
between samples of different types. To do this, we should probably look
at our sample metadata to see what types of samples we are looking
at.
# Let's look at some of the clinical sample metadata that we have
table(rms_sce$diagnosis) # everything is RMS
Rhabdomyosarcoma
26544
table(rms_sce$subdiagnosis) # sample is split between ERMS and ARMS
Alveolar rhabdomyosarcoma Embryonal rhabdomyosarcoma
14611 11933
# how many of each, 3 of each subdiagnosis, that's enough replicates so let's compare!
colData(rms_sce) %>%
as.data.frame() %>%
dplyr::count(batch, subdiagnosis)
It looks like we have three samples that are from ARMS patients and three samples that are from ERMS patients.
A good question we could ask is if there are differences in gene expression patterns in the Myoblast population between ARMS and ERMS patients. If we want to ask that question, then we need to have replicates, preferably at least 3 samples for each type of patient, which we do!
Let’s go ahead and try it!
Before we can do DE analysis to compare the gene expression profiles of myoblasts in ARMS vs. ERMS patients, we will need to “pseudo-bulk” the gene counts. Pseudo-bulking creates a new counts matrix that contains the total counts across all cells of a given label (e.g. cell type) for each sample. This allows us to use single-cell resolution to define the labels and sum gene counts across groups of cells containing the same label, and avoids counting each cell as its own replicate.
Below are a few reasons why pseudo-bulking is helpful:
Let’s start with the simple comparison of looking at one cell type and comparing across ERMS and ARMS subtypes. When we do this, we will first want to pseudo-bulk our dataset to group our cells by cell type and by sample. Then we can subset the pseudo-bulked dataset to just contain our cell type of interest before proceeding.
# first let's subset our coldata to only have the columns we care about in pseudobulking
pseudobulked_coldata <- colData(rms_sce)[, c("celltype_broad", "batch")]
# this creates a new SCE object that contains the pseudobulked counts across the provided groups
pseudobulked_sce <- scater::aggregateAcrossCells(rms_sce,
id = pseudobulked_coldata)
# column names aren't automatically added, so let's add them in
colnames(pseudobulked_sce) <- paste(pseudobulked_sce$celltype_broad,
pseudobulked_sce$batch, sep = "-")
pseudobulked_sce
class: SingleCellExperiment
dim: 573 37
metadata(202): salmon_version-SCPCL000478 reference_index-SCPCL000478 ...
variable_genes subset_hvg
assays(1): counts
rownames(573): ENSG00000278996 ENSG00000150672 ... ENSG00000135932
ENSG00000079805
rowData names(30): gene_symbol-SCPCL000478 mean-SCPCL000478 ... mean-SCPCL000498
detected-SCPCL000498
colnames(37): Fibroblast-SCPCL000488 Lymphocyte-SCPCL000484 ... Vascular
Endothelium-SCPCL000488 Vascular Endothelium-SCPCL000491
colData names(23): cell_id sum ... batch ncells
reducedDimNames(4): PCA UMAP fastMNN_PCA fastMNN_UMAP
mainExpName: NULL
altExpNames(0):
# only 37 columns in the counts assay
counts(pseudobulked_sce)[1:10, 1:10]
Fibroblast-SCPCL000488 Lymphocyte-SCPCL000484 Lymphocyte-SCPCL000488
ENSG00000278996 79 0 271
ENSG00000150672 121 9 50
ENSG00000157168 9 4 33
ENSG00000161671 56 38 194
ENSG00000258399 96 28 232
ENSG00000185532 1428 33 76
ENSG00000157445 52 28 93
ENSG00000198947 415 1 170
ENSG00000245532 631 4 458
ENSG00000155974 8 26 51
Monocyte-SCPCL000479 Monocyte-SCPCL000480 Monocyte-SCPCL000481
ENSG00000278996 51 224 76
ENSG00000150672 6 23 430
ENSG00000157168 0 14 342
ENSG00000161671 3 14 11
ENSG00000258399 4 6 34
ENSG00000185532 6 72 140
ENSG00000157445 22 2 0
ENSG00000198947 4 4 38
ENSG00000245532 67 358 471
ENSG00000155974 2 15 22
Monocyte-SCPCL000484 Monocyte-SCPCL000488 Monocyte-SCPCL000491
ENSG00000278996 43 457 262
ENSG00000150672 34 235 455
ENSG00000157168 2 112 1
ENSG00000161671 226 696 85
ENSG00000258399 33 813 0
ENSG00000185532 13 287 11
ENSG00000157445 16 265 41
ENSG00000198947 18 469 25
ENSG00000245532 265 2513 699
ENSG00000155974 9 96 32
Muscle-SCPCL000480
ENSG00000278996 209
ENSG00000150672 54
ENSG00000157168 33
ENSG00000161671 53
ENSG00000258399 55
ENSG00000185532 1208
ENSG00000157445 2
ENSG00000198947 44
ENSG00000245532 184
ENSG00000155974 23
Let’s take a look at what the colData now looks like in
the pseudo-bulked SCE object.
# note the new addition of the column with number of cells per group
colData(pseudobulked_sce)
DataFrame with 37 rows and 23 columns
cell_id sum detected subsets_mito_sum
<character> <numeric> <integer> <numeric>
Fibroblast-SCPCL000488 NA NA NA NA
Lymphocyte-SCPCL000484 NA NA NA NA
Lymphocyte-SCPCL000488 NA NA NA NA
Monocyte-SCPCL000479 NA NA NA NA
Monocyte-SCPCL000480 NA NA NA NA
... ... ... ... ...
Vascular Endothelium-SCPCL000480 NA NA NA NA
Vascular Endothelium-SCPCL000481 NA NA NA NA
Vascular Endothelium-SCPCL000484 NA NA NA NA
Vascular Endothelium-SCPCL000488 NA NA NA NA
Vascular Endothelium-SCPCL000491 NA NA NA NA
subsets_mito_detected subsets_mito_percent
<integer> <numeric>
Fibroblast-SCPCL000488 NA NA
Lymphocyte-SCPCL000484 NA NA
Lymphocyte-SCPCL000488 NA NA
Monocyte-SCPCL000479 NA NA
Monocyte-SCPCL000480 NA NA
... ... ...
Vascular Endothelium-SCPCL000480 NA NA
Vascular Endothelium-SCPCL000481 NA NA
Vascular Endothelium-SCPCL000484 NA NA
Vascular Endothelium-SCPCL000488 NA NA
Vascular Endothelium-SCPCL000491 NA NA
celltype batch celltype_broad
<factor> <character> <character>
Fibroblast-SCPCL000488 Fibroblast SCPCL000488 Fibroblast
Lymphocyte-SCPCL000484 Lymphocyte SCPCL000484 Lymphocyte
Lymphocyte-SCPCL000488 Lymphocyte SCPCL000488 Lymphocyte
Monocyte-SCPCL000479 Monocyte SCPCL000479 Monocyte
Monocyte-SCPCL000480 Monocyte SCPCL000480 Monocyte
... ... ... ...
Vascular Endothelium-SCPCL000480 Vascular Endothelium SCPCL000480 Vascular Endothelium
Vascular Endothelium-SCPCL000481 Vascular Endothelium SCPCL000481 Vascular Endothelium
Vascular Endothelium-SCPCL000484 Vascular Endothelium SCPCL000484 Vascular Endothelium
Vascular Endothelium-SCPCL000488 Vascular Endothelium SCPCL000488 Vascular Endothelium
Vascular Endothelium-SCPCL000491 Vascular Endothelium SCPCL000491 Vascular Endothelium
celltype_fine project_id submitter sample_id
<factor> <character> <character> <character>
Fibroblast-SCPCL000488 Fibroblast SCPCP000005 dyer_chen SCPCS000262
Lymphocyte-SCPCL000484 Lymphocyte SCPCP000005 dyer_chen SCPCS000258
Lymphocyte-SCPCL000488 Lymphocyte SCPCP000005 dyer_chen SCPCS000262
Monocyte-SCPCL000479 Monocyte SCPCP000005 dyer_chen SCPCS000253
Monocyte-SCPCL000480 Monocyte SCPCP000005 dyer_chen SCPCS000254
... ... ... ... ...
Vascular Endothelium-SCPCL000480 Vascular Endothelium SCPCP000005 dyer_chen SCPCS000254
Vascular Endothelium-SCPCL000481 Vascular Endothelium SCPCP000005 dyer_chen SCPCS000255
Vascular Endothelium-SCPCL000484 Vascular Endothelium SCPCP000005 dyer_chen SCPCS000258
Vascular Endothelium-SCPCL000488 Vascular Endothelium SCPCP000005 dyer_chen SCPCS000262
Vascular Endothelium-SCPCL000491 Vascular Endothelium SCPCP000005 dyer_chen SCPCS000265
diagnosis technology seq_unit
<character> <character> <character>
Fibroblast-SCPCL000488 Rhabdomyosarcoma 10Xv3 nucleus
Lymphocyte-SCPCL000484 Rhabdomyosarcoma 10Xv3 nucleus
Lymphocyte-SCPCL000488 Rhabdomyosarcoma 10Xv3 nucleus
Monocyte-SCPCL000479 Rhabdomyosarcoma 10Xv3 nucleus
Monocyte-SCPCL000480 Rhabdomyosarcoma 10Xv3 nucleus
... ... ... ...
Vascular Endothelium-SCPCL000480 Rhabdomyosarcoma 10Xv3 nucleus
Vascular Endothelium-SCPCL000481 Rhabdomyosarcoma 10Xv3 nucleus
Vascular Endothelium-SCPCL000484 Rhabdomyosarcoma 10Xv3 nucleus
Vascular Endothelium-SCPCL000488 Rhabdomyosarcoma 10Xv3 nucleus
Vascular Endothelium-SCPCL000491 Rhabdomyosarcoma 10Xv3 nucleus
subdiagnosis tissue_location disease_timing
<character> <character> <character>
Fibroblast-SCPCL000488 Alveolar rhabdomyosa.. Calf Initial diagnosis
Lymphocyte-SCPCL000484 Alveolar rhabdomyosa.. Thigh Initial diagnosis
Lymphocyte-SCPCL000488 Alveolar rhabdomyosa.. Calf Initial diagnosis
Monocyte-SCPCL000479 Embryonal rhabdomyos.. Prostate Recurrence
Monocyte-SCPCL000480 Embryonal rhabdomyos.. Prostate/bladder Recurrence
... ... ... ...
Vascular Endothelium-SCPCL000480 Embryonal rhabdomyos.. Prostate/bladder Recurrence
Vascular Endothelium-SCPCL000481 Embryonal rhabdomyos.. Pelvis Recurrence
Vascular Endothelium-SCPCL000484 Alveolar rhabdomyosa.. Thigh Initial diagnosis
Vascular Endothelium-SCPCL000488 Alveolar rhabdomyosa.. Calf Initial diagnosis
Vascular Endothelium-SCPCL000491 Alveolar rhabdomyosa.. Chest Recurrence
diagnosis_groups celltype_broad batch
<factor> <character> <character>
Fibroblast-SCPCL000488 ARMS Fibroblast SCPCL000488
Lymphocyte-SCPCL000484 ARMS Lymphocyte SCPCL000484
Lymphocyte-SCPCL000488 ARMS Lymphocyte SCPCL000488
Monocyte-SCPCL000479 ERMS Monocyte SCPCL000479
Monocyte-SCPCL000480 ERMS Monocyte SCPCL000480
... ... ... ...
Vascular Endothelium-SCPCL000480 ERMS Vascular Endothelium SCPCL000480
Vascular Endothelium-SCPCL000481 ERMS Vascular Endothelium SCPCL000481
Vascular Endothelium-SCPCL000484 ARMS Vascular Endothelium SCPCL000484
Vascular Endothelium-SCPCL000488 ARMS Vascular Endothelium SCPCL000488
Vascular Endothelium-SCPCL000491 ARMS Vascular Endothelium SCPCL000491
ncells
<integer>
Fibroblast-SCPCL000488 62
Lymphocyte-SCPCL000484 11
Lymphocyte-SCPCL000488 127
Monocyte-SCPCL000479 21
Monocyte-SCPCL000480 76
... ...
Vascular Endothelium-SCPCL000480 119
Vascular Endothelium-SCPCL000481 59
Vascular Endothelium-SCPCL000484 13
Vascular Endothelium-SCPCL000488 269
Vascular Endothelium-SCPCL000491 67
Before we can perform DGE analysis, we want to do some quality control like filtering out any groups in the pseudo-bulked dataset that may have a low number of cells.
# filter by number of cells
filtered_pseudobulked_sce <- pseudobulked_sce[, pseudobulked_sce$ncells >= 10]
We also want to filter out any genes that may be lowly expressed in our pseudo-bulked dataset.
# filter out genes that might be lowly expressed using filterbyExpr in `edgeR`
# provide sample groups from SCE to determine minimum sample size that genes should be expressed in
# we are grouping samples as ERMS vs. ARMS so use subdiagnosis
genes_filter <- edgeR::filterByExpr(filtered_pseudobulked_sce,
group = filtered_pseudobulked_sce$subdiagnosis)
filtered_pseudobulked_sce <- filtered_pseudobulked_sce[genes_filter,]
# looks like this doesn't actually filter anything though, but we probably want to keep this step in? it seems important
Maybe we should explore changing defaults in instruction? Adding the defaults here for future reference.
min.count = 10: Minimum count required for at least
some samples.min.total.count = 15: Minimum total count
required.large.n = 10: Number of samples per group that is
considered to be “large”.min.prop = 0.7: Minimum proportion of samples in the
smallest group that express the gene.The last filtering will be to filter to just the sub type of cells that we are interested in.
tumor_myoblast_sce <- filtered_pseudobulked_sce[, filtered_pseudobulked_sce$celltype_broad == "Tumor_Myoblast"]
table(tumor_myoblast_sce$celltype_broad)
Tumor_Myoblast
6
We should now see that there are only 6 samples left that all have
the cell type Tumor_Myoblast.
edgeRNote that at this point we can treat our data set like bulk RNA-seq
samples where each library is a sample and use the same methods we would
use for bulk, such as edgeR and DESeq2. First
let’s try and do some DE analysis with edgeR.
dge_list <- edgeR::DGEList(counts(tumor_myoblast_sce),
samples = colData(tumor_myoblast_sce))
# make column names equal to batch ids, helpful for plotting later
colnames(dge_list) <- dge_list$samples$batch
dge_list
An object of class "DGEList"
$counts
SCPCL000479 SCPCL000480 SCPCL000481 SCPCL000484 SCPCL000488 SCPCL000491
ENSG00000278996 3328 3214 3469 5496 8004 13768
ENSG00000150672 2772 7808 139767 7272 7892 44982
ENSG00000157168 3619 5494 87515 2693 6692 2034
ENSG00000161671 536 4758 4590 40668 31449 21699
ENSG00000258399 5464 4875 8841 16016 49919 218
568 more rows ...
$samples
NA
Now we need to normalize these counts!
Use the trimmed mean of M-values method to compute normalization factors. Counts are now large enough to apply bulk normalization methods rather than the deconvolution method that was used on individual libraries.
# compute some normalization factors
dge_list <- edgeR::calcNormFactors(dge_list)
# should now see a new column, `norm.factors`
dge_list$samples
Before we do any DGE, let’s do some QC checks to make sure our data looks okay. We make some Mean difference plots, but I’m not really convinced we get a whole lot from these?
# mean difference plots
purrr::walk(1:ncol(dge_list),
~ limma::plotMD(dge_list, column = .x))
# MDS scaling plot
# first compute counts per million
counts_per_million <- edgeR::cpm(dge_list, log = TRUE)
limma::plotMDS(counts_per_million,
# add some colors based on our two groups
col = ifelse(dge_list$samples$diagnosis_groups == "ARMS", "black", "red"))
Let’s do some DE now!
# set up design matrix to compare between ARMS and ERMS groups
design_matrix <- model.matrix(~factor(subdiagnosis),
dge_list$samples)
design_matrix
(Intercept) factor(subdiagnosis)Embryonal rhabdomyosarcoma
SCPCL000479 1 1
SCPCL000480 1 1
SCPCL000481 1 1
SCPCL000484 1 0
SCPCL000488 1 0
SCPCL000491 1 0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$`factor(subdiagnosis)`
[1] "contr.treatment"
Estimate dispersions using both the quasi-likelihood(QL) dispersion and the negative binomial(NB) dispersions.
# estimate NB dispersions
dge_list <- edgeR::estimateDisp(dge_list, design = design_matrix)
summary(dge_list$trended.dispersion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2358 0.2627 0.2665 0.2881 0.3085 0.4846
edgeR::plotBCV(dge_list)
# estimate the quasi-likelihood dispersions
glm_fit <- edgeR::glmQLFit(dge_list, design_matrix, robust = TRUE)
edgeR::plotQLDisp(glm_fit)
Test for differences using the glmQLFTest where
differentially expressed gens are defined as those with non-zero LFC and
a FDR of < 5%.
results <- edgeR::glmQLFTest(glm_fit, coef = ncol(design_matrix))
summary(limma::decideTests(results))
factor(subdiagnosis)Embryonal rhabdomyosarcoma
Down 30
NotSig 524
Up 19
edger_results <- edgeR::topTags(results, n="Inf")$table
edger_results_sig <- edger_results %>%
dplyr::filter(FDR <= 0.05)
# print out results
edger_results_sig
Let’s do the same thing, but try with DESeq2.
# deseq2 requires the coldata object as a separate input
coldata_df <- as.data.frame(colData(tumor_myoblast_sce))
# set up the deseq object
deseq_object <- DESeq2::DESeqDataSetFromMatrix(counts(tumor_myoblast_sce),
colData = coldata_df,
design = ~ diagnosis_groups)
converting counts to integer mode
Use the median of ratios method for count normalization followed by regularized log transformation.
# we get a warning if we don't use local that it can't find a model
# this could likely be due to batch effects present in the data
normalized_object <- DESeq2::rlog(deseq_object, blind = TRUE, fit = 'local')
Evaluate QC here using PCA. Here we can label by the diagnosis groups to show that the highest amount of variance is due to different diagnosis types.
DESeq2::plotPCA(normalized_object, intgroup = "diagnosis_groups")
We can also label by other metadata to see if there are other factors that might be related to variation.
DESeq2::plotPCA(normalized_object, intgroup = "disease_timing")
# run DESeq
# use the same fit type as with rlog
deseq_object <- DESeq2::DESeq(deseq_object, fitType = 'local')
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Next we take a look at the dispersions, we expect to see dispersions decrease as means are increasing and follow the line of best fit. Here dispersions are increasing as mean is increasing but are following the line of best fit, so not the most ideal… perhaps underlying batch effects are here and affecting this?
plotDispEsts(deseq_object)
deseq_results <- DESeq2::results(deseq_object, alpha = 0.05)
DESeq2::resultsNames(deseq_object)
[1] "Intercept" "diagnosis_groups_ERMS_vs_ARMS"
shrink_results <- DESeq2::lfcShrink(deseq_object, res = deseq_results, coef = 2, type = "apeglm")
using 'apeglm' for LFC shrinkage. If used in published research, please cite:
Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
sequence count data: removing the noise and preserving large differences.
Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
# look at what actually happened when we applyed shrinkage (adjustment of large fold changes that are overestimated)
comparison_df <- data.frame(
lfc_original = deseq_results$log2FoldChange,
lfc_shrunken = shrink_results$log2FoldChange,
logmean = log10(deseq_results$baseMean)
)
ggplot(comparison_df,
aes(x = lfc_original,
y = lfc_shrunken,
color = logmean)) +
geom_point(alpha = 0.1) +
theme_bw() +
scale_color_viridis_c() +
coord_cartesian(xlim = c(-10,10), ylim = c(-10,10))
Let’s get a table of the genes that are significant.
deseq_results <- shrink_results %>%
as.data.frame() %>%
tibble::rownames_to_column("ensembl_id")
deseq_results_sig <- deseq_results %>%
dplyr::filter(padj <= 0.05)
deseq_results_sig
Before we move on, let’s compare the results between the two types of
DEG analysis we did. Here we can look at the correlation between the
logFC calculated with edgeR and the logFC
calculated with DESeq2, which we expect to be similar. We can then see
which genes are calculated as significant with each method and how they
compare.
edger_results <- edger_results %>%
tibble::rownames_to_column("ensembl_id") %>%
dplyr::select(ensembl_id, logFC, FDR) %>%
dplyr::rename("edgeR_LogFC" = "logFC",
"edgeR_padj" = "FDR")
combined_results <- deseq_results %>%
dplyr::select(ensembl_id, log2FoldChange, padj) %>%
dplyr::rename("DESeq_LogFC" = "log2FoldChange",
"DESeq_padj" = "padj") %>%
dplyr::inner_join(edger_results, by = c("ensembl_id")) %>%
dplyr::mutate(
significance = dplyr::case_when(DESeq_padj <= 0.05 & edgeR_padj > 0.05 ~ "DESeq",
DESeq_padj > 0.05 & edgeR_padj <= 0.05 ~ "edgeR",
DESeq_padj <= 0.05 & edgeR_padj <= 0.05 ~ "both",
DESeq_padj > 0.05 & edgeR_padj > 0.05 ~ "None")
)
ggplot(combined_results, aes(x = DESeq_LogFC, y = edgeR_LogFC, color = significance)) +
geom_point(alpha = 0.5) +
theme_bw()
Here we see that it looks like edgeR takes a more
conservative approach and that any genes significant in
edgeR are also significant in DESeq. We also
see that there’s an extra subset of genes (with lower abs(log fold
changes)) that are identified as differentially expressed in
DESeq only.
Let’s also do some initial analysis to look at what types of genes are being identified and see if what is being identified as actually up/down regulated appear to correlate back to the single cell data.
The first plot we’ll make is a volcano plot and we can label the significant genes with their gene symbols. For now, let’s use the DESeq results.
deseq_df <- shrink_results %>%
as.data.frame() %>%
tibble::rownames_to_column("ensembl_id")
sce_rowdata_df <- rowData(tumor_myoblast_sce) %>%
as.data.frame() %>%
tibble::rownames_to_column("ensembl_id")
deseq_df <- deseq_df %>%
dplyr::left_join(sce_rowdata_df, by = "ensembl_id")
EnhancedVolcano::EnhancedVolcano(deseq_df,
x = 'log2FoldChange', # fold change statistic to plot
y = 'pvalue', # significance values
lab = deseq_df$gene_symbol.SCPCL000478, # labels for points
pCutoff = 1e-05, # The p value cutoff we will use (default)
FCcutoff = 1, # The fold change cutoff (default)
title = NULL, # no title
subtitle = NULL, # or subtitle
caption = NULL, # or caption
labSize = 3 # smaller labels
) +
# change the overall theme
theme_classic() +
# move the legend to the bottom
theme(legend.position = "bottom")
Let’s make some UMAPs where we plot a gene that’s identified to be differentially expressed in a given cell type and see what the expression of that gene is across samples.
# add column to colData with expression of gene of interest
# here we will actually put all of the significant genes so then we can plot any of them
sig_expressed_genes <- deseq_df %>%
dplyr::filter(pvalue < 1e-5,
abs(log2FoldChange) > 2) %>%
dplyr::pull(ensembl_id)
sig_genes_counts <- logcounts(combined_sce[sig_expressed_genes,]) %>%
as.matrix() %>%
t() %>%
as.data.frame() %>%
dplyr::mutate(cell_id = colnames(combined_sce))
combined_coldata_df <- as.data.frame(colData(combined_sce)) %>%
dplyr::left_join(sig_genes_counts, by = "cell_id")
# add back modified coldata containing gene expression for sig genes
colData(combined_sce) <- DataFrame(combined_coldata_df, row.names = combined_coldata_df$cell_id)
The first plot we will make is with SOX5 which is found in both
edgeR and DESeq
# filter to just myoblast cells and remove any NA's before plotting
myoblast_combined_sce <- combined_sce[, which(combined_sce$celltype_broad == "Tumor_Myoblast")]
# first look at just ARMS vs. ERMS
scater::plotReducedDim(myoblast_combined_sce,
dimred = "fastMNN_UMAP",
colour_by = "ENSG00000134532", #SOX5
point_size= 0.5,
point_alpha = 0.4,
other_fields = "diagnosis_groups") +
facet_wrap(~ diagnosis_groups, nrow = 3)
What about some genes that are differentially expressed in only
DESeq but not edgeR?
# pick out the top genes that are identified as DE in only DESeq and plot those
combined_results %>%
dplyr::filter(significance == "DESeq") %>%
dplyr::arrange(DESeq_LogFC)
# top downregulated gene
scater::plotReducedDim(myoblast_combined_sce,
dimred = "fastMNN_UMAP",
colour_by = "ENSG00000182197", #EXT1
point_size= 0.5,
point_alpha = 0.4,
other_fields = c("diagnosis_groups", "batch")) +
facet_wrap(~ batch + diagnosis_groups, nrow = 3)
scater::plotReducedDim(myoblast_combined_sce,
dimred = "fastMNN_UMAP",
colour_by = "ENSG00000161671", #EXT1
point_size= 0.5,
point_alpha = 0.4,
other_fields = c("diagnosis_groups", "batch")) +
facet_wrap(~ batch + diagnosis_groups, nrow = 3)
We can also look at other cell types in our dataset and set up the
same comparison across disease states to identify differentially
expressed genes. For simplicity we will just use DESeq2
here.
# pick out the next most prominent cell type to use
colData(filtered_pseudobulked_sce) %>%
as.data.frame() %>%
dplyr::group_by(celltype_broad) %>%
dplyr::summarise(total_cells = sum(ncells))
From this table, it looks like the tumor sub types are going to be the most promising.
We’re going to repeat the same analysis we did above just using the myocytes instead. I couldn’t copy and paste since we were doing it again, so here’s a function to repeat the analysis and we can run it across all cell types of interest in the analysis.
run_deseq <- function(sce, celltype){
# subset to cell type of interest
celltype_sce <- sce[, sce$celltype_broad == celltype]
coldata_df <- as.data.frame(colData(celltype_sce))
# create deseq object
deseq_object <- DESeq2::DESeqDataSetFromMatrix(counts(celltype_sce),
colData = coldata_df,
design = ~ diagnosis_groups)
# perform deseq
deseq_object <- DESeq2::DESeq(deseq_object, fitType = 'local')
# extract results
deseq_results <- DESeq2::results(deseq_object,
alpha = 0.05)
# shrink and set up data frame
shrink_results <- DESeq2::lfcShrink(deseq_object,
res = deseq_results,
coef = 2,
type = "apeglm") %>%
as.data.frame() %>%
tibble::rownames_to_column("ensembl_id")
# get rowdata from sce to get gene symbol
sce_rowdata_df <- rowData(celltype_sce) %>%
as.data.frame() %>%
tibble::rownames_to_column("ensembl_id") %>%
# only select columns we want, the gene symbol column will be modified in the setup script
dplyr::select(ensembl_id, gene_symbol.SCPCL000478) %>%
dplyr::rename(gene_symbol = gene_symbol.SCPCL000478)
# join together by ensembl id, adding in rowdata includes gene symbol
deseq_results_df <- shrink_results %>%
dplyr::left_join(sce_rowdata_df, by = "ensembl_id")
return(deseq_results_df)
}
We already added the counts for some genes to the
combined_sce object so we can use that to make some plots.
Starting with FOXO3, in the same family as the fusion
partner (PAX3/PAX7-FOXO1) found in ARMS.
# subset to just tumor celltypes that we are interested in
tumor_sce <- combined_sce[, which(combined_sce$celltype_broad %in% celltypes)]
# violin plot across cell types
scater::plotExpression(tumor_sce,
features = "ENSG00000118689", #FOXO3
x = "diagnosis_groups",
colour_by = "diagnosis_groups",
other_fields = "celltype_broad",
point_size = 0.1) +
facet_wrap(~ celltype_broad)
As the differential expression results showed, FOXO1 is
up regulated in ARMS in mesoderm cells and myoblasts, but appears to be
almost equally highly expressed in myocytes in both disease sub
types.
In addition to looking at differential expression of genes across cell types, we can also compare the distribution of cell types in each sample and compare across conditions. In this analysis, we quantify the number of cells assigned to each cell type and measure if the abundance of cells assigned to a specific label changes in conditions (e.g. in ARMS vs. ERMS).
The example that we are going to use is through the
edgeR package following the guidance in Orchestrating
Single Cell Analysis.
Here we need to set up an object where the counts are not reads per gene, but cells per label.
# first remove any samples with NA
filtered_combined_sce <- combined_sce[, !is.na(combined_sce$celltype_broad)]
# create a table of cell abundances
# treat each cell type as a gene
cell_abundance <- table(filtered_combined_sce$celltype_broad, filtered_combined_sce$batch) %>%
unclass()
# use original sample metadata to grab diagnosis for each library
sample_metadata <- sample_metadata %>%
dplyr::select(library_id, subdiagnosis) %>%
dplyr::mutate(diagnosis_groups = factor(
dplyr::case_when(
subdiagnosis == "Alveolar rhabdomyosarcoma" ~ "ARMS",
subdiagnosis == "Embryonal rhabdomyosarcoma" ~ "ERMS"
))) %>%
# filter out any NA cell types
dplyr::filter(library_id %in% colnames(cell_abundance)) %>%
DataFrame()
rownames(sample_metadata) <- sample_metadata$library_id
# set up DGElist
abundance_dge_list <- edgeR::DGEList(cell_abundance, samples = sample_metadata)
Filter out any labels that are low abundance.
celltypes_keep <- edgeR::filterByExpr(abundance_dge_list, group = abundance_dge_list$samples$diagnosis_groups)
abundance_dge_list <- abundance_dge_list[celltypes_keep, ]
Set up the design matrix the same way you would with differential expression analysis and perform the analysis.
design_matrix <- model.matrix(~factor(diagnosis_groups), abundance_dge_list$samples)
abundance_dge_list <- edgeR::estimateDisp(abundance_dge_list, design_matrix, trend = "none")
abundance_fit <- edgeR::glmQLFit(abundance_dge_list, robust = TRUE, abundance.trend = FALSE)
abundance_results <- edgeR::glmQLFTest(abundance_fit, coef = 2)
# print out results and summary
abundance_results
An object of class "DGELRT"
$coefficients
(Intercept) factor(diagnosis_groups)ERMS
Lymphocyte -4.777493 -0.5206590
Monocyte -2.937735 -1.1421817
Tumor_Mesoderm -4.103203 2.0360792
Tumor_Myoblast -1.140934 0.5505157
Tumor_Myocyte -2.100419 -0.2827945
Vascular Endothelium -3.614289 -0.4755841
$fitted.values
SCPCL000478 SCPCL000479 SCPCL000480 SCPCL000481 SCPCL000484
Lymphocyte 16.25586 10.65656 22.21319 26.08197 24.30173
Monocyte 55.18556 36.17701 75.40957 88.54336 82.49979
Tumor_Mesoderm 413.62008 271.14953 565.20064 663.63943 618.34240
Tumor_Myoblast 1811.34781 1187.43296 2475.15774 2906.24633 2707.87909
Tumor_Myocyte 301.50170 197.65009 411.99391 483.74929 450.73075
Vascular Endothelium 54.63790 35.81799 74.66121 87.66466 81.68107
SCPCL000488 SCPCL000491 SCPCL000495 SCPCL000498
Lymphocyte 49.24475 30.61228 40.90586 44.21122
Monocyte 310.87677 193.25202 258.23427 279.10061
Tumor_Mesoderm 96.80953 60.18023 80.41623 86.91418
Tumor_Myoblast 1875.52465 1165.89258 1557.93155 1683.81855
Tumor_Myocyte 718.39266 446.57833 596.74321 644.96241
Vascular Endothelium 157.95726 98.19183 131.20947 141.81171
$deviance
Lymphocyte Monocyte Tumor_Mesoderm Tumor_Myoblast
10.674925 1.135274 11.560367 14.031233
Tumor_Myocyte Vascular Endothelium
13.871089 1.197436
$method
[1] "oneway"
$unshrunk.coefficients
(Intercept) factor(diagnosis_groups)ERMS
Lymphocyte -4.780807 -0.5229791
Monocyte -2.938213 -1.1433249
Tumor_Mesoderm -4.104864 2.0375729
Tumor_Myoblast -1.140966 0.5505533
Tumor_Myocyte -2.100594 -0.2828701
Vascular Endothelium -3.615285 -0.4762262
$df.residual
[1] 7 7 7 7 7 7
$design
(Intercept) factor(diagnosis_groups)ERMS
SCPCL000478 1 1
SCPCL000479 1 1
SCPCL000480 1 1
SCPCL000481 1 1
SCPCL000484 1 1
SCPCL000488 1 0
SCPCL000491 1 0
SCPCL000495 1 0
SCPCL000498 1 0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$`factor(diagnosis_groups)`
[1] "contr.treatment"
$offset
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 8.092239 7.669962 8.404472 8.565031 8.494334 8.67761 8.202208 8.49208 8.569786
attr(,"class")
[1] "CompressedMatrix"
attr(,"Dims")
[1] 6 9
attr(,"repeat.row")
[1] TRUE
attr(,"repeat.col")
[1] FALSE
$dispersion
[1] 3.863155
$prior.count
[1] 0.125
$df.residual.zeros
[1] 7 7 7 7 7 7
$df.prior
[1] 2.532037 2.532037 2.532037 2.532037 2.532037 2.532037
$var.post
Lymphocyte Monocyte Tumor_Mesoderm Tumor_Myoblast
1.2796751 0.2788763 1.3725662 1.6317832
Tumor_Myocyte Vascular Endothelium
1.6149826 0.2853977
$var.prior
[1] 0.6014862
$samples
$AveLogCPM
[1] 12.76116 15.02523 16.25104 18.77918 16.69355 14.40420
$table
$comparison
[1] "factor(diagnosis_groups)ERMS"
$df.test
[1] 1 1 1 1 1 1
$df.total
[1] 9.532037 9.532037 9.532037 9.532037 9.532037 9.532037
summary(decideTests(abundance_results))
factor(diagnosis_groups)ERMS
Down 0
NotSig 6
Up 0
It doesn’t look like any of the populations are differentially abundant between the two conditions. I did find this statement from the paper, “ARMS tumors contained significantly fewer cells with the mesoderm gene expression signature and were skewed towards the myocyte signature.” However, they do show quite a bit of variability in the supplemental figures across the samples and they only show them for the PDX samples while the ones we are analyzing are the patient samples. We do see a positive fold change trending towards an increase in the mesoderm population in ERMS and a decrease in myocytes in ERMS, aligning with the findings in the paper (although not significant with these samples).
edgeR
and DESeq, but we probably don’t need to actually do that
in class, we should pick one and show how you would do the analysis
first to look at one cell type between experimental conditions and then
across multiple cell types/ labelsedgeR is a bit more conservative in its approach to
identify differentially expressed genessessionInfo()